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SUMMARY 


This paper describes a new technique of passive ranging which is based on utilizing the image- 
plane expansion experienced by every object as its distance from the sensor decreases. This 
technique belongs in the feature/object-based family. 

The motion and shape of a small window, assumed to be fully contained inside the bound- 
aries of some object, is approximated by an affine transformation. The parameters of the 
transformation matrix are derived by initially comparing successive images, and progressively 
increasing the image time separation so as to achieve much larger triangulation baseline than 
currently possible. Depth is directly derived from the expansion part of the transformation. 

To a first approximation, image-plane expansion is independent of image-plane location 
with respect to the focus of expansion (FOE) and of platform maneuvers. Thus, an expansion- 
based method has the potential of providing a reliable range in the difficult image area around 
the FOE. In areas far from the FOE the shift parameters of the affine transformation can provide 
more accurate depth information than the expansion alone, and can thus be used similarly to the 
way they have been used in conjunction with the Inertial Navigation Unit (INU) and Kalman 
filtering. However, the performance of a shift-based algorithm, when the shifts are derived from 
the affine transformation, would be much improved compared to current algorithms because the 
shifts — as well as the other parameters — can be obtained between widely separated images. 

Thus, the main advantage of this new approach is that, allowing the tracked window to 
expand and rotate, in addition to moving laterally, enables one to correlate images over a 
very long time span which, in turn, translates into a large spatial baseline — resulting in a 
proportionately higher depth accuracy. 


ACRONYMS USED IN TEXT 


FOE 

- Focus of Expansion 

FOV 

- Field of View 

INU 

- Inertial Navigation Unit 

LOS 

- Line of Sight 

OF 

- Optical Flow 

PSF 

- Point-Spread-Function 

SNR 

- Signal-to-Noise Ratio 

PSR 

- Peak-to-Sidelobes Ratio 

TBD 

- Track Before Detect 

TTC 

- Time To Collision 

3-D 

- 3-Dimensional 

AFTR 

- Affine Transformation 

CW, CCW 

- Clockwise, counterclockwise 
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1 INTRODUCTION 


Passive ranging is an area of considerable interest for applications such as obstacle avoidance 
for rotorcraft nap-of-the-earth navigation and spacecraft landing. Two main passive- ranging 
methods can potentially be employed for this purpose; one based on motion and the resulting 
image-plane optical flow, and the other based on stationary stereo. Both methods can be 
thought of as special cases of a more general triangulation method known as “bearing-only” or 
“direction-of-arrival” (e.g., [1, 2, 3, 4]). In this paper we chose to concentrate on monocular 
OF-based ranging. 

The motion of an imaging sensor causes each imaged point of the scene to correspondingly 
describe a time trajectory on the image plane. The trajectories of all imaged points constitute 
what’s called the “optical flow” (OF). A forward-looking imaging sensor, such as a TV camera 
or a Forward-Looking Infra-Red (FLIR), is typically used as the source of optical flow data. 
The various methods of extracting depth information from the OF can be classified as belonging 
into three main classed as we did in [5]. The method described in this paper can be considered 
object-based or feature-based depending on the definition of features. If the features are chosen 
based on some local image property, such as texture or edge, then we are dealing with a feature- 
based method. If, the feature is chosen through some pre-segmentation to be wholly contained 
inside a physical object, then our new technique can be considered to be object-based; that is 
how we chose to regard it in this paper. 

Like all other passive-ranging methods, we assume that the scene and its illumination 
sources are temporally constant (see [6]). We also assume that all points belonging to the 
same object share the same range. In [5] we differentiated between detect-then-track and track- 
before-detect (TBD) algorithms (akin to filtering and smoothing respectively) and pointed out 
the advantages of the TBDs in terms of SNR-performance and robustness (see [7, 8, 9]). We 
will return to this subject after presenting our new algorithm, and show that it is a TBD one. 

The OF at any given point in the image plane consists of three kinds of motion: lateral 
translation, expansion (or divergence), and rotation (curl). When considering a window of some 
finite size, one can approximately describe its time evolution by an affine transformation which, 
in the most general case, has six parameters: four belonging to the 2x2 multiplying matrix, 
and two belonging to the vector of lateral translation. Most depth-estimation methods, such as 
described in [10, 11], make use of the lateral motion alone. Two basic limitations are implicit 
in these methods. First, they cannot perform in the image plane close to the FOE, and second, 
they can only use a relatively short triangulation baseline because far-apart images would not 
correlate due to the misadjustment in the other components of the affine transformation (besides 
the shifts) which are not accounted for. “Triangulation baseline” is the term we use for the 
distance the platform travels between the frames to be correlated. As we have shown in our 
earlier work [12], the depth-error is inversely proportional to the triangulation baseline (see (18) 
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ahead). 


In this work I will discuss methods of extracting depth information from the divergence of 
the OF as it is approximately obtained from the affine transformation matrix. I use the term 
“divergence” (or “local divergence”) to refer to the mathematical definition of the derivative- 
vector operator denoted by V which, in this case, scalar- multiplies the velocity vector at a point. 
Divergence is thus defined for an infinitesimal area and time. We use “expansion” , or “global 
divergence”, as a short-hand for the “rate of area expansion” to denote the average divergence 
over the area of some finite-size window or of an object. We will soon see why the divergence of 
the velocity vector, V • v(p) at some point p actually measures the rate of area expansion (which 
explains the above proliferation of terms). 
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Figure 1: Texture and size cues. 


The idea of using divergence as a source of depth information is not new. The works 
of Longuet-Higgins and Prazdny [13], Prazdny [14, 15], Koenderink [16], Koenderink and van 
Doom [17, 18], and Nelson and Aloimonos [19] elaborate quite extensively on this subject. 
Recently, an interesting extension to these works was reported by Ringach and Baram in [20]; 
although it is field-based, it explicitly assumes that the scene is composed of objects (defined 
by their borders) and derives the global divergence for all objects without the need to actually 
delineate or identify them. The local- and global-divergence methods are intended for different 
kinds of objects as exemplified in figure 1 . The local-divergence method is intended for textured 
objects with no well-defined edges, whereas the global- divergence method is intended for objects 
with little or no texture but having well-defined edges. In this paper we rely upon the objects 
being textured, so our algorithm roughly derives the equivalent of local divergence. 

If we examine a window centered on the FOE, its translational motion is zero by definition, 
but it still expands as the depth decreases, and this expansion is left as the only source of 
depth information. Thus, there are two new aspects to our work; one is the direct derivation 
of depth from expansion, and the other is enabling the use of a long triangulation baseline for 


4 




even using just the conventional translation-based methods (as well as for the expansion-based 
ones). The later feature is the one that transforms this algorithm from a track-then-detect 
to a TBD, because the accuracy (or SNR) of the final result is based on the largest available 
baseline, as opposed to (Kalman) filtering of results that were individually obtained based on a 
single-interframe baseline. 

This is why one can consider this work to represent an extension of the existing translation- 
based algorithms such as the one developed by Sridhar, Phatak, and Cheng in [10, 11] and 
Sridhar, Suorsa and Hussien in [21] which derive the image-plan translations of “points of 
interest” (small windows) through spatial cross- correlation between consecutive images and 
subsequent Kalman filtering of their image-plane trajectories. 

In order to round off the picture, we also need to refer to another closely- related area of 
research represented by the work of Merhav and Bresler (see [22, 23, 24, 25]). The first three 
papers primarily address image-plane motion estimation, which is, of course, equivalent to depth. 
Also, they rely upon the assumption (that we do not need to make) that the image statistics in 
the X and Y directions are separable. The fourth paper suggests a stochastic-gradient approach 
to image-plane motion estimation which can be thought of as a precursor of the work reported 
here. 

As a last comment, it is noteworthy that utilizing divergence (or expansion) for depth 
derivation has been largely motivated by advances in the understanding of visual processing 
in humans and primates. For example, experiments with humans suggest the existence of 
divergence (looming) detectors in the human visual system [26, 27, 28] as well as vorticity 
detectors [28, 29, 30]. 

The organization of this report is as follows. Section 2 contains the theory relating Diver- 
gence, Expansion, and Depth. Section 3 presents the idea of using the affine transformation to 
relate objects in different frames. Section 4 presents simulation results. Section 5 presents the 
practical algorithm that iterates over increased frame separation. Section 6 discusses the error 
analysis. 


2 OPTICAL FLOW, DIVERGENCE AND EXPAN- 
SION 


The basic equations for the divergence in the image plane are derived in this section. This 
derivation is based on prior work described in [13] to [20]. 

It is convenient to think for a moment of imaging the outside scene onto a spherical surface 
because such projections are identical irrespective of the camera-axis direction. In fact, with 
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Figure 2: The geometry of projection onto the image plane. 

such geometry, the camera axis is defined to coincide with the line-of-sight (LOS) from the center 
of the sphere to any imaged point as seen in figure 2. Another motivation for regarding the image 
plane as a sphere is that this geometry is similar to that of imaging the world by a lens onto 
a spherical retina, e.g., in the human eye. Let us define the coordinate system of the spherical 
camera to have its origin at the sphere’s center and its Z axis to pass through the imaged point 
P of some object. The sphere is defined to be of unity radius. Consider the projection of P 
onto the sphere at point p. At that point define the origin of an (U, V) plane tangent to the 
sphere which is called local projective image plane (image plane, for short); this image plane 
approximates the sphere at the point of tangency. Let us assume that P is found on a smooth 
surface described by some function z — f(x, y) so that its gradient Vz = (zx,zy) exists. The 
distance of any point on that surface from the sphere’s center can then be approximated in the 
neighborhood of P by r 

z « z 0 + Vz • (x,y) , (1) 

where Zo is the distance between points 0 and P. The relative motion of the camera with respect 
to the scene is defined by its translational velocity V = (Vy, Vy , Vz) and rotational velocity u> = 
(wy,a )y,uz)- It is convenient to normalize V by z 0 and define (vx,vy,vz) = (Vy, Vy, Vz)/z 0 . 

The motion of the camera causes the stationary point P and its surrounding to describe 
a retinal velocity field (or optical flow ) around p on the image plane. We denote image-plane 
projections by (u,t>), to correspond with ({/, V), and their temporal derivatives by (u<,i>t). 
Thus, the image-plane velocity vector at p is defined as v(p) = (u t ,v t ) | p . The spatial partial 
derivatives of (uj, v t ) are denoted by u tx , Ut y , v tx , Vt v . From [13] we know that the following 
equations hold at p, 

t it = —vx —wy, v t — ~ v y + wy , 
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Utx =V Z + v x zx , 


Vty = Vz + VyZy , 

v tx = — 1 w z + vyzx 


( 2 ) 


Uty = w z 4- v x zy , 


Using the above equations, the divergence at p (denoted div(p)) can be expressed as 

div(p) = V • v(p) = u tx + v ty = 2v z + Vz • (v*, vy ) (3) 

To interpret the above equation, suppose that the camera only moves in the Z direction. In 
that case v x = vy = 0 and V • v(p) = 2v z = 2 V z /z 0 , that is, div(p) is twice the reciprocal 
of the time-to-collision (TTC) of P with the camera’s center. Because of this interpretation, 
div(p) is termed ^immediacy” in [16] and other papers, that is, it measures the immediacy of 
an imminent collision. In the opposite case, when (v x , vy ) ^ (0,0) and v z = 0, there can still 
be a relative depth change between the camera and the patch because it is generally slanted. In 
other words, div(p) will still have the same interpretation as before, except that the imminent 
collision is going to be with some point on the plane tangent to the patch at P and not with the 
point P itself. Thus both terms of the immediacy have a valid physical interpretation. Notice 
that the rotational velocities do not appear in div(p). This is a very important (and well-known) 
observation because it says that the TTC information is wholly contained in the imagery; no 
additional information is needed (such as from the INU). 

Nelson and Aloimonos describe in [19] a straight-forward mechanism for evaluating the 
divergence from a sequence of images. In practice, this algorithm can only provide a rough 
estimate of the local divergence. 


The global divergence is defined (see [20]) as the average divergence over the area of each 
object, and denoted by x(R) for an object whose projection onto the image plane is R (assuming, 
for the moment, that its boundary dR is well defined). Thus, 

x(R) = mb div(p) i 3 =mL v ' v(p) is= m L v(p) ■ n d ' • < 4 > 

where A(R) is the object area, ds is the elemental area, dl is the elemental length along dR , n 
is a unit vector normal to dR, and the equality is based on the divergence theorem. In words, 
the average divergence equals the line integral of the normal component of the velocity vector 
at the edge along the edge of the object. This line integral can easily be shown (see [20]) to 
have an intuitive interpretation, that is, 


x(R) = 


1 d A(R) 
A(R) d t 


(5) 


i.e., the global divergence equals 


the temporal rate of change of the normalized object area. 


To find the relationship between global divergence, expansion, and time-to-collision, con- 
sider the similar-triangles equation relating the image-plane projection at (u, 0) of some point 
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(6) 


similar to P but located at (x = /,y — 0,z — zq ) in figure 2, 

/ 

u = — 

^0 

Taking the derivative of u with respect to Zq , we find that 

du = d Zq = —Vzdt = UV Z dt , (7) 

Zo Zo 


Thus, 


If we repeat this derivation for an area 
would find, using d A/ A = 2d u/u, that 


1 du 

— rr = v z 


(8) 


u dt 

change, dA, rather than for a length change, du, we 


hft = ** (9) 

Comparing (9) to (5), it is seen that \ has the interpretation of twice the TTC. Thus, the 
normalized (by the area) temporal rate of change of the projected area A of some object, that 
is, its rate of area expansion , equals twice the TTC. 


3 ESTIMATING THE RATE OF EXPANSION 

In this section we introduce the affine transformation, and develop the algorithm necessary to 
estimate the object’s rate of expansion. 

3.1 The affine transformation 

The affine transformation (AFTR) can be used to relate object’s projections at different frames 
(or times); its most general form is defined by six parameter. However, we intuitively judged 
that four parameters should suffice because they directly convey the physically-interpretable 
changes one would expect to occur. We thus define our specific AFTR by 


u 


■ 09 

—99 ' 

U — Uq 


a + u 0 

V 

— s 

S9 

09 

v — Vo 

+ 

b + v o 


where s is a scaling (or expansion) factor, 09 = cos (9) and S9 = sin(0), and 6 is the angle by 
which the object in h is CW rotated with respect to its original orientation in I 0 . Thus, this 
AFTR maps points ( u,v ) from one frame (7 0 ) onto the corresponding points (ti,u) in another 
frame (h). In figure 3 we notice that, first, the object expanded about 50%, second, it rotated 
about 25° CCW, and third, it moved up and right. This is indeed the order of mappings conveyed 
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Figure 3: Mapping of a point through the affine transformation. 

by the above definition although the order of scaling and rotation is immaterial. Notice that 
scaling and rotation is performed around the arbitrarily-defined center point of the object located 
at (tto,uo), and shifting is performed later — back to the original center point plus an incremental 
shift by the vector (a, 6). 


3.2 Vehicle’s maneuvers and image-plane motion 



Figure 4: Window. 


In this subsection we calculate the transformation that an object's projection undergoes as a 
result of platform maneuvers so as to relate it with the AFTR as defined above. To do that, we 
start from the well-known equations for the temporal derivatives of the image-plan projections 
(tt,v). Repeated as in [21], and adapted to our earlier notation, we have 

.2 


uv 


it* 


U t = —fvx + UVz + UX ~J /^(l + ~p) + vu> z 


uv . tr. 

Vt = —JVy + Wz -OJy — + JUJxyl + -p) - UVz , 


( 11 ) 


where / is the focal length. Now consider the shifts experienced by the corners of the window 
shown in figure 4. The differences between their shifts can be used to yield rotation and expan- 
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sion. The rotation of the upper side of the square (where u x = v 0 ) during some interframe time 
can be approximated by 

Auj — Avo t wk 


ui - u 0 


= —UJz - 


f 


( 12 ) 


When the point (uo, Uo) coincides with the FOE, this reduces to —u>z- The rotation of the left 
side of the square (where uq = U2) is similarly found as 


Au 2 - Au 0 

= -uj z 7 — 

v Q -v 2 J 


(13) 


which also reduces to —u>z at the FOE. We have used the rotations of vertical and horizontal 
lines to show that, first, they rotate slightly differently, that is, in principle, the square distorts, 
and, second, this rotation approximately equals the platform roll. Comparing the two terms on 
the right of (12) for equal platform roll and yaw, the yaw term is smaller by a factor of f/v 0 . At 
a distance of, say, 50 pixels from the FOE, and with f=622 (using our camera as an example), 
this factor is 12.4. Since the expansion-based algorithm suggested here is intended to mainly 
enhance depth derivation around the FOE, we conclude that image-plane rotation is reasonably 
approximated by platform roll. 


by 


Next, let us analyze the expansion factor. For the upper side of the square it is approximated 


A«i - Au 0 v 0 u>x («o + «i)u^ 

= v z + — 7 7 

- «o / / 


(14) 


and for the left side of the square by 


Auo — Al>2 
v 0 - v 2 


UqUy (t>0 + v 2 )u) X 

f f 


(15) 


At the FOE, both expressions approach vz as the square size goes to zero. Again, horizontal and 
vertical lines expand slightly differently, but, to a good approximation, this expansion equals vz 
(the TTC). The superfluous terms are an order of magnitude smaller than vz for areas up to 50 
pixels from the FOE and small angular speeds. 


Our conclusion is that, over the expected range of flight scenarios, the affine transformation 
represents a good approximation to the actual mapping that is talcing place between different 
frames. If this approximation is not adequate, one can always use the full 6 degrees of freedom 
available in the general affine transformation. 


3.3 What happens when scaling and rotation are ignored 


In this subsection we elaborate on the importance of using the AFTR even for an algorithm 
which calculates depth based on the shifts alone. Ignoring the AFTR amounts to taking it to be 
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Average peak-to-sidelobe-ratio 



Figure 5: Average peak-to-sidelobes ratio as a function of image size for different distortions. 

Normalized registration -error 



Figure 6: Registration-error standard deviation as a function of image size for different distor- 
tions. 
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a unity matrix. This question has been investigated quite extensively by Mostafavi and Smith 
in [31, 32]. For completeness, we summarize their results here. 


For images having a circularly symmetric Gaussian correlation function, 


R{t u ,t v ) = exp 



(16) 


where r u , r v are the spatial shifts, and A the “correlation width”, the effects of non-compensated 
rotation (by 0) and/or scaling (by s) are determined by the combined geometric-distortion 
parameter <f, 


d = yf\ 1 - 2s cos 0 + s 2 | « y^(l - s) 2 + 0 2 for small 8 and s«l (17) 

Figure 5 shows the effect of d on the peak-to-sidelobes ratio (PSR). Peak stands for the maxi- 
mum value of the cross-correlation function between two frames, and “sidelobes” stands for the 
standard deviation of the cross-correlation function far from the peak. The reference image is 
taken as a square of size L x L. The other (sensed) image is much larger than the reference 
image. In the figure, L appears normalized by the correlation width because what counts is the 
effective number of “independent” image objects. Six graphs are shown for different d values. 
The graph for d = 0.087, for example, can be used for rotation alone (of 5°), or for scaling alone 
(s = 1.087), or for any of their combinations such that (17) yields d = 0.087. Figure 6 similarly 
shows the behavior of the registration error. 

Let us use the following example to demonstrate the effect of uncompensated rotation or 
scaling errors. Take speed Vx = 25 m/s, depth zq = 120 m ; image-plane location 10 pixels from 
the FOE, a rolling maneuver of u>z = 20°/s, L = 21, A = 1.5 pixels, and frame rate of 2 fr/s. 
This low frame rate is used to achieve a large triangulation baseline as will be explained later. 
Only two consecutive frames are used in this example. 

In a single interframe time the platform rotates 10° and there is an expansion by a factor 
of s = 120/(120 - 25 • 0.5) = 1.1163, so that d = 0.21. The PSR will incur a loss of « 3 (6 
dB in PSR power) — as read from figure 5. This is why, without using the AFTR, one needs to 
use a higher frame rate, say, 10 fr/s. The registration error, as extrapolated from figure 6, will 
increase from 0.025 to 0.070 pixels. In [12] we have found the depth error: 


cr z 


bu 


(18) 


where b is the triangulation baseline. Thus, the depth error incurred by a geometrically- 
compensated algorithm (b = 12.5 m) is 4.1 m while that incurred by a non-compensated al- 
gorithm (fe = 2.5 m) is 57 m (out of 120 m !). 


This example shows that, even in the conventional shift-based algorithm, neglecting to 
compensate for the AFTR in the process of cross-correlation is costly in two ways. First, it 
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either degrades the PSR which may hinder locking onto the correct peak (false alarm) or impose 
a short 6, and second, even when correct peak detection is achieved, the depth error would 
increase around tenfold. 


3.4 Converging on the correct affine transformation 


In this subsection we derive the equations and algorithm necessary to obtain the correct affine 
transformation. The basic idea is to use Newton’s equation (see [33]) iteratively to converge from 
the initially-assumed transformation matrix into the correct one by minimizing an appropriate 
error measure, or cost-function. 

We thus start by defining the cost-function, J, as the integral over the window area, A, of 
the squared difference of image gray levels, that is, 

e = h(u,v) - Io(u,v ) ; J ^^A.U A e2dA ( 19 ) 

If all points (u, v) inside the window (defined in Jo) are correctly mapped into (u, v) of Ji, then 
the above cost should equal zero. In practice, however, we can only expect to minimize this 
cost albeit not to drive it to zero. Our plan is to find the gradient and second derivatives 
of J so that we can use Newton’s method to solve for the minimum assuming that the cost- 
function is quadratic in the four parameters to be estimated. Since this assumption only holds 
approximately, it is necessary, in practice, to iterate a few times until the solution converges. 
The iterative update equation for the estimated parameter vector X(k) becomes 

X(k + 1) = X(k) - {V 2 J[X(A:)]} _1 VJ[X(k )] , (20) 

where 

*(*) = 


a 

b 

s 

e 


( 21 ) 


The four components of the cost-function gradient are calculated next. Starting with the 
first shift-parameter, a, 


dJ 1 ff de 1 ff dl x 

da~ AJJA C da iA AJjA^da 6 ’ 


( 22 ) 


because only the Ii(u,v) part of e depends on a through u, v. Developing that relationship, 


dh _ dlidu dhdv 
da du da A dv da 


(23) 
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Similar equations are obtained for the other three parameters by substituting them in place of a 
in (23). The above four equations require the partials of u, v with respect to all four parameters. 
These are obtained by differentiating the two scalar equations obtained from (10), that is, 


so that 


u = s[Q?(u — u 0 ) — Sfl(u - u 0 )] + «o + a 
v = s[S0(u — uo) + Q9(u — v 0 )] + v 0 + b , 


(24) 


du 

da 


= 1 


£-» 

^ = 0 9(u — «o) — S0(u — uo) 
Os 

= -s[S0(u - u 0 ) + 09(u - vo)) 


dv 

da 


= 0 


dv 

db ~ 

^ = S0(u — u 0 ) + O0(v - u 0 ) 

Os 

^ = s[Q?(u — «o) — 99(t> — Vo)] (25) 


We now need the ten second derivatives of the symmetrical matrix V 2 J[X(A:)]. In order 
to simplify notation, we will drop the “dA” from the integrals, the subscript 1 from 7, and the 
hats from u,v; these will now be understood whenever not specified. Let us start with one of 
the mixed second derivatives, say that of a and 0. We thus have 

d 2 J _ d_ (dJ\ _ J M dedt d_ 
da\d0j A J Ja da dO 6 da 


dadO 

After some more algebra, we get 

<!jL = L If + v—— + w 

dadO AjJa dadO dadO 
where 


dl du dl dv 

dZd6 + ai'B 


(26) 


du dv du dv 
da dO dO da 


+ c 


dl d 2 u dl d 2 v 


du dadO dv dadO 




+ e 


a 2 / 

du 2 ; 




r dl_ 

dv 


\ 2 d 2 I 

J +e a^ ; 


w A 
W = t 


dim_ 

dudv du dv 


(27) 


(28) 


The other mixed second derivatives of J are similar and can be obtained by substituting the 
other parameters in place of a and 0 in (28). The second (non-mixed) derivatives can, of course, 
be obtained by substituting the same parameter twice. For example, the second derivative of J 
with respect to a is 


d 2 J 

da 2 




+ e 


dl d 2 u dl d 2 v 

+ 


du da 2 dv da 2 


(29) 


Notice that the above equations require two kinds of building blocks; these are the first and 
second (also mixed) spatial derivatives of the 1\ image as well as the first and second (also mixed) 
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derivatives of u and v with respect to the four transformation parameters. The image spatial 
derivatives are calculated by convolving it with a simple Sobel-operator-type 3x3 window. The 
first derivatives of u and v where already calculated for the gradient as in (25). Differentiating 
(25) once again yields 10 second derivatives for u and 10 for v. Out of these 20, all turn out to 
be zero except the following four: 

5S - - )-<*-«.)—!; - a*. - -) - sx. - *) - 1 

= «[-<»(.. - u„) + £»(» - «,)] = ^ = -«[»(« - «„) + C9(v - Do)] = ^(30) 

At this point all the components necessary for a single iteration on the Newton’s solution have 
been derived. 


4 SIMULATIONS OF THE COST-FUNCTION AND 
ITS DERIVATIVES 


We now want to examine the behavior of the cost-function and its derivatives as a function 
of the four parameters in open loop, that is, without trying to correct the errors yet. For the 
following experimental results we used simulated imagery where the scene is composed of a wall 
normal to the initial flight trajectory. This wall is painted with a random Gaussian colored 
noise having spatial correlation width of 2 pixels in each of the two spatial dimensions. In this 
section we describe the main features of our Flight/Vision simulation and the open-loop error 
measurements. 


4.1 Flight/Vision simulation 

We have developed a simple simulation that enables us to generate a sequence of images (im- 
agery) as obtained from an optical sensor that travels and maneuvers as prescribed. This 
simulation is described here. 

The scenery is composed of a flat wall oriented normal to the initial LOS. The gray levels 
of the wall are derived by passing a white Gaussian noise through a two-dimensional Gaussian- 
shaped low-pass filter of some desired width. The wall is densely sampled by “wall-pixels” 
which, when imaged onto the camera’s focal plane, are much finer than the “chip-pixels” of the 
camera. Typically 25 wall-pixels fall inside a single chip-pixel at the beginning of the run; this 
is chosen so that the wail can approximately be considered continuous. The correlation width of 
the low-pass filter above is chosen in terms of equivalent chip-pixels. In all simulations described 
later we chose correlation width of 2 chip-pixels because that is a typical width for the lens’ 
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point-spread-function. The number of wall-pixels impinging on each chip-pixel is proportional 
to the depth squared because the camera uses a fixed angular field of view. Thus, to maintain 
a constant wall brightness on the image plane, we have to factor the wall brightness (or gray- 
level) by the depths-squared inverse. This compensation is nothing more than simulating the 
dependence of light radiation (power-per-area) on the inverse of the range squared. 

The camera is initially located across from (and pointing to) the wall center at a distance 
of zq m. It is generally flying towards the wall center and can perform any desired maneuvers 
on its way. Each ray from the center of a wall-pixel to the camera’s focal point (in world 
coordinates) gets transformed into the camera’s coordinates through the 3x3 rotation matrices 
corresponding to yaw, pitch, and roll ( e.g see [34]). The camera coordinates of the ray are used 
in the projection equations to yield the image coordinates of the ray’s piercing point, that is, 

« = /-; v = l~, ( 31 ) 

z z 

We now assume a point-spread-function (PSF), having the shape of a chip-pixel and centered on 
the (non-integer) (it, v) point, to impinge upon the grid of chip pixels. This is where interpolation 
becomes necessary. 



Figure 7: The interpolation method. 

The method of interpolation is explained with the help of figure 7. The (u,u) point falls at 
a distance of (Su, Sv ) from some integer point (uo, Vo). We thus assign the PSF areas intersected 
by each of the 4 chip-pixels to these pixels. The corresponding areas are thus assigned as follows: 

(1 - <5u)(l - Sv) to pixel (u 0 ,t;o) 5 

(1 - Su)Sv to pixel ( u 0 ,v 0 + 1) ; 

6u(l - Sv) to pixel (ii 0 + 1, wo) ; 

SuSv to pixel (u 0 + + 1) , (32) 

where (Su, Sv) are derived as 

Su = u — int(u) ; Sv = v — int(w) , (33) 
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and int(-) is a function that rounds off its argument to the nearest lower integer. 

As we have said above, there are, around 25 such partial contributions into every chip 
pixel — each contribution resulting from the center of a wall-pixel being projected onto some 
different (u, v) point. We have found that choosing the ratio between the sides of a chip-pixel 
and a wall-pixel to equal 5, using a chip-pixel-size PSF, and interpolating as prescribed above, 
results in a realistically-appearing textural behavior of the wall as it gets closer to the camera 
during the simulated flight. Examples showing the time evolution of the imaged wall will be 
shown in the sequel. 


4.2 Simulation of the error equations 

The error equations are, in principle, simulated as prescribed by equations (19) to (30). However, 
since we are deeding with a spatially-discretized images, it is necessary to implement these 
equations in a discrete form as well. There are no conceptual problems associated with replacing 
integrals by summations. However, all we know about the real physical image values comes from 
the pixels’ gray-level data. It is important to understand that the gray-level of a pixel represents 
the value of a double integral over its area (average), where the spatially-continuous radiation 
emanating from the scene serves as the integrand. Another way to put it is that each pixel 
collects all the photons impinging anywhere within its boundaries during its integration time 
(interframe time). 

Differentiating between a pixel’s gray-level and the actual value of the scene at any (contin- 
uous) location on the image plan is important in estimating the scene values h(u, v) as required 
in (19) because (u, v) are generally non-integers. There is no such problem in estimating 7o(it, u) 
because, by definition, we start from the pixel’s center (integer) and thus take its gray-level as 
the best estimate of the scene value at this pixel’s center. For the estimation of h(u, v), we 
use an interpolation method that looks identical to the one used for the imagery generation, 
although it is conceptually completely different. 

Referring once more to figure 7, here is the problem. Say we have an estimate for the value 
of the scene at the center point of some initial pixel, that is, we have Io(uo,vo). This point has 
been mapped into location (u,t>) in image I\, and we want to estimate 7 1 (u,u). The relevant 
information available from image I\ is its pixel values for the four pixels shown in the figure 
because these are directly affected by the original scenery patch (of pixel size). We can think of 
the value of each such pixel as a random variable crosscorrelated with 7o(«o 5 ^o) in proportion 
with the intersected areas as defined by (32). This led us to use the rather ad hoc interpolation 
method: 

h(u,v) = (1 — 6u)(l - 6v)Ii(uo,vo) + £u(l - Su)Ii(u 0 ,v 0 + 1) 

+6u(l — 6v)Ii(u 0 + 1, t>o) + Su6vli(u 0 + l,t> 0 + 1) (34) 
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This method has the advantage that it yields the expected results when (u,v) take on integer 
values, and it provides a continuous estimate inside the convex hull defined by the values of the 
four nearest pixels. The same interpolation method is used for estimating the image values as 
well as their first and second derivatives. 


4.3 Open-loop error measurements 

In the first set of open-loop error simulations we investigated the error sensitivity to the scaling 
factor s in isolation as a function of window size . The flight trajectory used for this set is 
non-maneuvering and constant-velocity towards the center of the wall starting from a depth of 
150 m at a speed of 1 m/fr. The set of 3 images (number 0, 12, and 24) are shown in figure 8 to 
demonstrate the effect of expansion as the depth decreases from 150 to 138 to 126 m. Figure 9 
shows the case of a 11 x 11 window size which is centered on the FOE. The first and fifth frames 
are used for Iq and /i respectively so that the baseline is b — 4 m. The figure shows four curves. 
Three curves belong to the cost-function and its first and second derivatives as derived in the 
previous section. The fourth curve shows the correction for s as calculated by the Newton s 
algorithm of (20), that is, the third component of {V 2 3 J[X(fc)]} _1 VJ[X(k)]. The four graphs 
in each figure are scaled as necessary for convenient presentation. Figure 10 and figure 11 only 
differ from figure 9 by the window size as indicated in their titles. Figures 12 and 13 represent 
contraction — as opposed to expansion — and they serve to verify symmetry in comparison with 
figures 10 and 11 respectively. 

The following observations are noteworthy. 

1. The absolute values of all four variables increase monotonically with the window size. The 
reason is that, since the free variable is an expansion factor, it causes each pixel of the 
window to shift in linear proportional to its distance from the center of the window. Thus, 
the larger the window, the larger are the shift errors experienced by its pixels. 

2. The values of the cost-function and its first and second derivatives roughly agree; this 
is not obvious because each derivative is obtained directly from the corresponding image 
derivatives. Low-pass-filtering of the image derivatives and the fact that we deal with 
discrete pixel values and have to resort to interpolation, can account for the numerical 
disparities. 

3. The actual value of s, to be denoted s a , is shown by the vertical bars in all figures. It 
is noticed that, in all 5 cases it falls closer to the minima of the cost-functions than to 
the zero crossings of the first derivatives. We do not have a satisfactory explanation for 
this behavior except to assume that these are noise-like inaccuracies resulting from the 
quantization and interpolation operations; they clearly diminish as the window becomes 
larger. It warrants commenting here that it is the zero crossing of the derivative which 
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Figure 8: Frames 0, 12, and 24 of simulated textured wall seen while flying forward. 
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Cost and its sensitivity to scale factor 



Figure 9: Sensitivity of the cost-function and its derivatives to the scale factor (11x11 window). 


Cost and its sensitivity to scale factor 



Figure 10: Sensitivity of the cost-function and its derivatives to the scale factor (21 x 21 window). 
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Cost and its sensitivity to scale factor 



Figure 11: Sensitivity of the cost-function and its derivatives to the scale factor (41 x41 window) 


Cost and its sensitivity to scale factor 



Figure 12: Sensitivity of the cost-function and its derivatives to the scale factor (21 x 21 window) 
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Figure 14: Interpolation example. 
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matters and not the minimum of the cost-function because that is where the closed-loop 
system would converge to. 

4. The second derivative shows a sharp slope change at s = 1; the first derivative and the cost- 
function itself show corresponding behavior. The reason for that is explained by analyzing 
our interpolation method as shown in figure 14 for a simple one-dimensional case. The 
black dot represents the center point, (u, v), of one of the Iq pixels that got shifted — as a 
result of expansion by some factor s > 1 — to its new location in image I\. The rectangle 
centered on the dot represents the original Jo pixel. This new location is shifted by Su 
with respect to where it would fall if s equaled unity. Let us take the gray-level of this 
particular Io pixel as unity with all its neighbors being zeroes. This pixel will cause the 
gray-levels of image I\ to become 

G 0 = 0 ; 6?i = (1 — 6u ) ; G 2 = 6u (35) 


In order to generate the error curves, we sweep the value of s over some range around 
3 = 1. The lower rectangle in the figure represents the location of the corresponding swept 
pixel for some s > 1 (denoted by s a ) which is different from the actual s a . This swept 
pixel is shown shifted by 6s. Interpolating for the current value of s = s s , we have 

/j(u, t>) = Gi(l - 6s) + G 2 6s = (1 - 6u){ 1 - 6s) + 6u6s = 1 - 6s - 6u + 2 6u6s (36) 
When s sweeps through values less than unity, i.e., s a < 1, we have 

h(u,v) = Gi(l- | 6s |) + Go | 6s |= Gi(l- | 6s |) = (1 - 6u)( 1- | 6s |) , (37) 

which is always less than the corresponding result for a positive 6s. 

We thus conclude that, for an expansion, when the actual s is larger than 1, sweeping s 4 
over values of s 4 > 1 always results in h(u, v) larger than those resulting from symmetrical 
(around s = 1) values of 3 , < 1. This effect becomes more pronounced as the window 
size increases because the window pixels are, on the average, farther from its center and 
they experience larger 6u shifts. When the actual s is smaller than 1, we see the opposite 
behavior as exemplified by figures 12 and 13. In these, the actual s is s a = 146/150 = 
0.9733. It is important to realize that, since the closed-loop algorithm performs around s„ 
and not around s = 1, it is not affected by the above phenomenon. 


5. The curves of ds give the calculated correction for the case where the error occurs (through 
sweeping) in s alone. In such a case, the correction part of equation (20) simplifies to 


ds = 


dJ/ds 

d 2 J/ds 2 


(38) 


It can be seen from the figures that ds approximately agrees with this equation. Also, the 
discontinuities in the first and second derivatives at s = 1 cancel each other in (38) so that 
the ds graphs do not show any discontinuity. 
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Figure 15: Frames 0, 4, and 8 of simulated textured wall as seen while rolling with no lateral 
motion. 
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Cost and its sensitivity to rotation 



Figure 16: Sensitivity of the cost-function and its derivatives to rotation (11 x 11 window). 


Cost and its sensitivity to rotation 



Figure 17: Sensitivity of the cost-function and its derivatives to rotation (21 x 21 window). 
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Cost and its sensitivity to rotation 



Figure 18: Sensitivity of the cost-function and its derivatives to rotation (41 x 41 window). 


Cost and its sensitivity to rotation 



Figure 19: Sensitivity of the cost-function and its derivatives to rotation, 21 x 21 window, 
positive rotation. 
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In the next set of error measurements we investigated the error sensitivity to rotation angle 
0 in isolation as a function of window size. For this set the camera does not travel laterally; it 
only rolls at —0.02 rad/fr while pointing towards the center of the wall from a constant depth 
of 150 m. The set of 3 images (number 0, 4, and 8) axe shown in figure 15 to demonstrate the 
effect of rotation. Figure 16 shows the case of a 11 x 11 window size which is centered on the 
FOE. Figures 17 and 18 correspond to windows of size 21 and 41 respectively. The first and 
sixth frames axe used for Jo and I\ respectively so that the total roll used in generating the first 
3 figures is of —0.1 rad. Figure 19 shows a roll in the opposite direction for a symmetry check. 
The same four curves as before are shown in all figures. 

The following observations can be made. 


1. The absolute values of all four variables increase monotonically with the window size. The 
reason here is the same that applied to the scaling-only cases. The larger the window the 
laxger the shifts experienced by pixels which axe farther from the window center. 

2. The values of the cost-function and its first and second derivatives roughly agree as for 
the s curves. 


3. The actual value of 6 is shown by the vertical bars in the figures. It is noticed that the 
bars fall close to the minima of the cost-functions and also to the zero crossings of the first 
derivatives. The laxger the window, the more accurate these results axe. 


4. There are no marked discontinuities as found in the s curves because the reason that 
caused it there does not apply here. 

5. The curves of d 6 give the calculated correction for the case where the error occurs (through 
sweeping) in 0 alone. In such a case equation (20) simplifies to 

dJ/d6 


d9 d 1 2 J/d0 2 

In the figures d# approximately agrees with this equation. 


(39) 


In the next set of error measurements we investigated the error sensitivity to image-plane 
shifts, a, in isolation as a function of window size. For this set the camera is stationary except 
that it is panning at 0.0005 rad/f while pointing towards the center of the wall from a constant 
depth of 150 m. Images numbers 0 and 4 are used for J 0 and Ji respectively. The panned images 
axe not shown because they look quite indistinguishable — being shifted only by about a pixel. 
Figure 20 shows the case of a 21 x 21-size window (top) and 41 x 41-size window (bottom) when 
both axe centered on the FOE. The following observations can be made. 

1. As opposed to the previous cases, where s or 0 served to generate the errors, here there is 

very little sensitivity to the window size because the shifts are equal for all pixels within 
the window of any size. 
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Cost and its sensitivity to shift 



Figure 21: Sensitivity of the cost-function and its derivatives to shift over a wide range (21 x 21 
window). 


2. The actual value of a is marked by the vertical bars in all figures. These bars fall close to 
the minima of the cost-functions and also to the zero crossings of the first derivatives. As 
before, the larger the window, the more accurate these results are. 

3. The second-derivative discontinuities at the integer pixel shifts can be explained by argu- 
ments similar to those used in the case of the s curves. 


4. The curves of da give the calculated correction for the case where the error occurs (through 
sweeping) in a (or b) alone. In such a case, the correction part of equation (20) simplifies 
to 

dJ/da 
° = d 2 J/da 2 

In the figures, da approximately agrees with this equation. 



5. Figure 21 shows the behavior of the cost-function curve for large shifts — where it becomes 
highly non-linear. The Newton’s solution loses much of its value at such large errors. 
However, convergence is still possible inside the error region defined by the nearest zero- 
crossing of the first derivative on either side of the zero-error point (±4 pixels here). Inside 
this region the correction still shows the right sign. 
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4.4 Closed-loop performance 

In this subsection we summarize the results of closed-loop runs. These runs are divided into 
four groups. The first three groups parallel the open-loop cases of forward-flying, rolling, and 
panning (yaw). In the fourth run there are maneuvers in all variables so we could test the 
most general case. Within each group there are two kinds of parameters. One parameter is the 
window size, and the other is the location of the window with respect to the FOE. 

In each run the errors axe corrected using the Newton’s method for six iterations. Theo- 
retically, Newton’s method should “converge” in one shot for any ideal parabolic cost-function. 
We allow for discrepancies from the ideal by (1) iterating on the solution more than once, (2) 
factoring the corrections by an experimental factor of 0.75 to prevent overshoots, and then, (3) 
bounding 6s by ±0.03, 66 by ±0.03 rad, and 6a, 6b by ±0.75 pixels. 

Each of the graphical results for all runs include five curves to show the convergence of 
the cost-function, J, and the four parameters: s, 6, a, and b. In addition, there are four bars 
(arbitrarily located between iteration number 4 and 5) whose ordinates show the ground-truth 
values of the four parameters for ready visual comparison. The bars are marked by the parameter 
symbols. 


Convergence of error & parameters 



Iteration number 


Figure 22: Convergence for forward flying and no maneuvers at the FOE (21 x 21 window). 

Let us start with the results for forward-flying with no maneuvers. The initial depth is 150 
m and the velocity is 1 m/fr towards the center of the wall. The transformation parameters 
are calculated at the time of frame number 4 by comparing it to frame number 0 (skipping the 
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Convergence of error & parameters 



Figure 23: Convergence for forward flying and no maneuvers at (20,20) from the FOE (21 x 21 
window). 
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Figure 24: Convergence for forward flying and no maneuvers at the FOE (41 x 41 window). 
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Convergence of error & parameters 



Figure 25: Convergence for forward flying and no maneuvers at (20,20) from the FOE (41 x 41 
window). 

intermediate frames). These runs are intended to demonstrate expansion alone for a window 
centered on the FOE, and expansion-plus-shift for a window centered on the point (20,20) with 
respect to the FOE. The following observations can be made: 

1. The cost-function and all parameters practically converge in two iterations. When no 
parameter correction hits its bounds, convergence is achieved in a single iteration. 

2. The accuracies — especially for s — improve noticeably as the window size doubles (4 times 
the window area), but they are still very good for the 21 x 21-size window. For example, 
from figure 22, the correct expansion (indicated by the s bar) is 150/146=1.0274, which 
corresponds to 146 frames-to-collision, whereas the converged value is s = 1.0296 which 
corresponds to 135 frames-to-collision. 

3. The converged shifts for the (20,20) point practically show no error. This is especially 
impressive because these shifts are small — only (0.548,0.548) pixels. 

Next, we present the results for roll-only flying without any forward or lateral motion. The 
depth is constant at 150 m. The transformation parameters are calculated at the time of frame 
number 2 by comparing it with frame number 0. the roll-angle difference is 0.04 rad between 
these two frames. In these runs we demonstrate rotation alone for a window centered on the 
FOE, and rotation-plus-shift for a window centered on the point (20,20) with respect to the 
FOE. The following observations can be made: 
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Figure 26: Convergence for roll-only maneuver at the FOE (21 x 21 window). 
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Figure 27: Convergence for roll-only maneuver at (20,20) from the FOE (21 x 21 window). 
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Convergence of error & parameters 



Figure 28: Convergence for roll-only maneuver at the FOE (41 x 41 window). 
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Figure 29: Convergence for roll-only maneuver at (20,20) from the FOE (41 x 41 window). 
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1. As before, the system practically converge within two iterations. 

2. Although the cost-function — especially in figure 26 — does not converge as close to zero as 
in all other case, the parameters still converge accurately to their respective values. 

3. The accuracies improve noticeably as the window size doubles. From figure 24 and fig- 
ure 25, 6 virtually has zero error, while its error increases to 3.6% for the 21 x 21 window. 

4. The expansion shows a transient for the (20,20) point, but it settles to zero after 2 itera- 
tions. 

5. The converged shifts at the (20,20) point are remarkably close to the correct ones of 
(0.8, 0.8) pixels. 


Convergence of error & parameters 



Figure 30: Convergence for yaw-only maneuver at the FOE (21 x 21 window). 

Next, we present the results for yaw-only flying with no forward or lateral motion. The 
depth is constant at 150 m. The transformation parameters are calculated at the time of frame 
number 4 by comparing it with frame number 0; the yaw-angle difference is 0.002 rad. We 
translate this yaw angle by using the fact that, in our Flight/Vision simulation, the camera’s 
FOV is taken as 10 degrees, and it corresponds with an image of size 128 x 128. This means that 
the expected shift is 6u = 1.467 pixels. Thus, in these runs, we demonstrate <5u-shift alone for 
a window centered on the FOE or on the point (26,26) with respect to the FOE. The following 
observations can be made: 
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Figure 31: Convergence for yaw-only maneuver at (26,26) from the FOE (41 x 41 window). 

1. Irrespective of the window size, or the location of the image-point with respect to the 
FOE, all converged parameters are close to being error free, 

2. The expansion and rotation show transients which decay to zero after two iterations. 

Lastly, we present the results for a general maneuver where the velocity is 1 m/s (starting 
from 150 m depth), pitch and yaw rates axe 0.0005 rad/s each, and the roll-rate is 0.02 rad/s. 
The transformation parameters are calculated at the time of frame number 2 in figures 32, 33, 
and 35, and at frame number 4 in figure 34 by comparison with frame number 0. The following 
observations can be made: 

1. The system converges within two iterations. 

2. Generally, the accuracies improve with the window size. 

3. The accuracy of s is around 6% for the FOE point — irrespective of the window size (21 to 
61) — and it drops to 16% for the (20,20) point. 


Convergence of error & parameters 



Iteration number 


Summarizing the simulation results, we can conclude that the basic idea and algorithm 
are solid and perform very well. Although these simulations were done in apparently noise-free 
situation, they do get affected by the noise inherent in the pixel quantization. 
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Convergence of error & parameters 



Figure 32: Convergence for general maneuvers at the FOE (21 x21 window, 2-frames difference). 


Convergence of error & parameters 



Figure 33: Convergence for general maneuvers at (20,20) from the FOE (21 x 21 window, 2- 
frames difference). 
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Figure 34: Convergence for general maneuvers at (20,20) from the FOE (21 x 21 window, 4- 
frames difference). 
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Figure 35: Convergence for general maneuvers at the FOE (61 x 61 window, 2-frames difference) 




5 INCREASING THE TRIANGULATION BASE- 
LINE 

In this section we use the above algorithm a s the core on which a farther layer is to be built 
with the intention of increasing the accuracy and robustness of the practical algorithm. The 
implicit assumption here is that the flight trajectory is basically non-maneuvering, or, in other 
words, it is the maneuvers which will determine the maximum usable triangulation baseline. 

5.1 The capture zone 


Normalized Correlation peak vs. shift 



Shift/window- width 


Figure 36: Average normalized correlation peak vs. shift, Delta in image-width fraction. 

We have touched on the question of convergence in regard to figure 21. In that figure the “capture 
zone” is of ±4 pixels — meaning that, as long as the error is within this zone, it always has the 
correct sign to drive it towards the stable solution. Thus, convergence is assured inside this 
zone, although its width is not usually known — especially when more than a single parameter is 
involved. It is possible, however, to estimate some lower bounds on the capture zone for each one 
of the four parameters. Estimating the width of the capture zone is based on the bandwidth or 
correlation width of the images. For that, we used A = 1.5 pixels in conjunction with figures 5, 
6, 36, and 37. What it means is that image-plane locations 1.5 pixels apart have gray- levels 
correlated with a correlation coefficient of exp{— 0.50} = 0.606 (see (16). 
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Average peak-to-sidelobe-ratio vs. shift 



0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1 

Shift/ window-width 


Figure 37: Average peak-to-sidelobe-ratio vs. shift, Delta in image- width fraction. 

To estimate the capture zone, we arbitrarily assume that a PSR=7.5 is acceptable to provide 
a high enough probability of detecting the correct correlation peak and a low enough probability 
of false alarm (locking onto a wrong peak). This figure is equivalent to 15 dB in power ratios. 
Let us assume that the window size is 21 x 21; then A of 1.5 pixels is ~ 0.07 of the window-size. 
From the corresponding graph in figure 37 we read that a PSR=7.5 is achieved for shifts less 
than 0.063 of the window size, i.e., ±1.32 pixels. Repeating this exercise for A = 2 would result 
in a smaller capture zone of only 0.97 pixels. 

A word about figures 36 and 37 is now in place. Figure 36 shows that the correlation peak 
drops slowly with the shift when A is large — as expected from (16). However figure 37 shows 
that the higher the A, the higher the PSR’s initial value is, and the sharper its drop. This 
result is attributed to the fact that, when A is large, the effective number of independent image 
areas (objects) decreases. That has no effect on the mean correlation peak but it increases 
the sidelobes variance. The sidelobes variance of the cross-correlation, C(t u ,t v ), is given by 
equation (A19) of [31], 

var{C(r u , r v )} = L ~ 2 /+~ J+~ v)R{u, v)R(u , u)dudu + 

L~ 2 f-£f-£ 9{u, v)R(u - r„, v - t v )R(u + r Ui v + r„)dudu , (41) 


where 



(l-\u/L\)(\-\v/L\), 

0 


(u,v) € [~L,L] x [~L,L\ 
otherwise , 


(42) 
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is a triangular window that weighs the integrand, L 2 is the window area used for normalization, 
and (u,t)) are the transformed (u,v) of (10). Far from the crosscorrelation peak, only the first 
term of (41) prevails, and that is what we used for constructing the above figures. 

To estimate the capture zone for the expansion factor, s, and the rotation, 6, we refer back 
to figure 5. For the same case L/A = 14, and we see from the figure that this is achieved with 
d = 0.148 which is equivalent to 8.5° of rotation or s = 1.148 of expansion. Overall, we have 
shown that the capture zone is quite wide, and there is some optimal window size that can be 
chosen for any given correlation width. Images from real scenes are highly non-stationary in the 
sense that A might be small for one part of the image and large for another. However it can 
never be smaller than the PSF which is why we used A = 1.5 as a PSF- width estimate. 


5.2 The iterative algorithm 

In the iterative algorithm we start with frames which are close enough in time to ensure that 
the errors in the four parameters fall inside the worst-case capture zone. Let us say that we 
initially use frame-0 and frame- 1, so the frame separation is one. The Newton’s equations are 
iterated upon until the error converges. The converged parameters are then used to predict the 
initial values for a larger frame separation, say, between frame-0 and frame-4 (notice that the 
first frame of the pair is fixed here). The same is now repeated for this new frame separation. 
Thus, there are two nested iteration loops; the inner one iterates on the Newton’s equations 
un til convergence is achieved for some fixed frame separation; the outer loop iterates through 
increased frame separation. The algorithm can be summarized by the following pseudo code. 

frame.separation * 1; 
f rame.O = 0; 

frame. 1 = frame.O + frame.separation; 

while(frame < last.frame) { 

while (error has not converged) { 
solve Newton’s eqs. to update 
a, b, s, theta; 

> 

if (final error is low) { 
increase frame separation; 
frame.l = frame.O + frame.separation; 
save last parameter values; 

predict initial parameter values for new frame separation; 

} 

else -( declare previous-iteration results as final; } 
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> 


When running in batch for some fixed number of frames, the outer iteration loop must stop 
when the frame separation cannot be increased any more. Another condition to stop is that the 
converged results of the last frame-separation are not satisfactory, as judged by some criteria. 


The prediction of initial parameter values for the next (larger) frame separation is calculated 
from the converged parameters of the previous frame separation using the projection equations 
(31). Let us project an object of length l onto the image plane so that its projection is defined 
as unity. After decreasing the depth from Zq to Z\ , the projection changes to sj. For a frame 
separation of < 1 , that can be written as 

1 = — ; si = — ; zi = zq — Vzt\ , (43) 

Zo z 1 


from which 


si 


Zj 0 . 

Zo — VzU 


z o 


»\VzU 

Si - 1 


Rewriting the last equation for some S 2 , ^2 instead of for Si, t\, and solving for S 2 , we get 


(44) 


s 2 


Sl*l 

h - Sl(*2 - *l) 


(45) 


This is how the current expansion estimate (for the current frame separation) is used to 
predict the expansion estimate for a larger frame separation, The other three parameters are 
predicted based on linear extrapolation, so that 

02 = ; &2 = biti/ti ] &2 — (46) 

After the algorithm stops, (44) is used to calculate the current best estimate of the initial depth 
Zq based on the last pair of s;, t, which corresponds to the largest triangulation baseline that 
yielded convergence. 


5.3 Performance of the iterative algorithm 

First we ran the iterative algorithm on our simulated imagery, and then on some real imagery. 

Let us start with a typical run on the simulated imagery. It is a non-maneuvering, forward- 
flying case with velocity of 2 m/s. The first frame pair is made up of frame-0 and frame-2. The 
window of size 21 x 21 is initially centered on pixel (74,74) which is 10 pixels away from the 
FOE (which is at (64,64)) in u and v. There are 40 frames in the set. The following screen 
output reports progress in the estimation of the initial depth of 150 m. Each table-like block 
of numbers reports the convergence of the inner loop for the current frame separation. The 
inner-loop iteration number is k and the error is denoted by err. 
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Opened forward.flying frame 0 
Opened forvard.flying frame 2 


k,a,b,s,theta err 
k,a,b,s,theta err 
k,a,b,s,theta err 
k,a,b,s,theta err 
k,a,b,s,theta err 
k,a,b,s,theta err 
k,a,b,s,theta err 
k,a,b,s,theta err 
k,a,b,s, theta err 
k,a,b,s,theta err 
k,a,b,s , theta err 


0 0.000000 0.000000 1.000000 0.000000 267.672333 

1 0.440236 0.406381 1.030000 -0.005372 141.133240 

2 0.236247 0.248784 1.021997 0.000912 79.217903 

3 0.293832 0.275524 1.020823 -0.000823 83.259949 

4 0.278676 0.270192 1.020453 -0.000340 82.172791 

5 0.283285 0.271448 1.020556 -0.000496 82.425667 

6 0.281893 0.271135 1.020525 -0.000446 82.349678 

7 0.282312 0.271216 1.020533 -0.000462 82.372147 

8 0.282187 0.271194 1.020531 -0.000457 82.365257 

9 0.282224 0.271201 1.020531 -0.000458 82.367386 

10 0.282213 0.271199 1.020531 -0.000458 82.366638 


Current estimate of initial depth = 198.825650 


Opened forward.f lying frame 5 

k, a,b,s,theta err * 0 0.705533 0.677997 1.052959 -0.001145 74.528183 
k,a,b,s,theta err * 1 0.708271 0.690493 1.069563 -0.000779 55.280674 
k,a,b,s, theta err * 2 0.712015 0.685594 1.065833 -0.000990 55.973583 
k,a,b,s,theta err = 3 0.710209 0.685661 1.066386 -0.000896 55.718666 
k,a,b,s, theta err = 4 0.710757 0.685674 1.066294 -0.000920 55.761593 
k,a,b,s,theta err = 5 0.710619 0.685669 1.066310 -0.000915 55.753265 
k,a,b,s, theta err ■ 6 0.710652 0.685670 1.066307 -0.000916 55.754974 
k,a,b,s,theta err = 7 0.710644 0.685671 1.066308 -0.000916 55.754597 


Current estimate of initial depth = 160.812401 
Opened forward.f lying frame 10 


k,a,b,s,theta err * 0 1.421288 1.371342 1.142033 -0.001831 73.417000 
k,a,b,s,theta err ■ 1 1.565485 1.555250 1.153044 0.000626 28.408524 
k,a,b,s , theta err = 2 1.531177 1.529034 1.151938 -0.000007 27.785374 
k,a,b,s,theta err = 3 1.540096 1.533858 1.152508 0.000302 27.518284 
k,a,b,s .theta err = 4 1.537325 1.532594 1.152262 0.000157 27.572006 
k,a,b,s,theta err - 5 1.538222 1.532950 1.152353 0.000221 27.549091 
k,a,b,s,theta err «= 6 1.537919 1.532843 1.152320 0.000194 27.556232 
k.a.b.s, theta err * 7 1.538024 1.532876 1.152332 0.000205 27.553633 


k,a,b,s, theta err = 8 1.537986 1.532865 1.152328 0.000200 27.554605 
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k,a,b,s,theta err = 9 1.538000 1.532869 1.152329 0.000202 27.554232 


Current estimate of initial depth * 151.294483 


Opened f orward_flying frame 16 


k,a,b,s,theta err = 
k,a,b,s,theta err * 
k,a,b,s,theta err = 
k,a,b,s,theta err * 
k,a,b,s,theta err = 
k,a,b,s,theta err = 
k,a,b,s,theta err = 
k,a,b,s,theta err = 
k,a,b,s, theta err = 
k,a,b,s,theta err = 


0 2.460800 2.452590 1 

1 2.784964 2.769740 1 

2 2.683191 2.691992 1 

3 2.703243 2.702418 1 

4 2.697401 2.699811 1 

5 2.698910 2.700440 1 

6 2.698529 2.700282 1 

7 2.698624 2.700322 1 

8 2.698600 2.700312 1 

9 2.698606 2.700315 1 


268244 

0.000323 

122. 

269186 

-0.001672 

24. 

270512 

0.000779 

17. 

268434 

0.000409 

16. 

268962 

0.000566 

16. 

268833 

0.000553 

16. 

268864 

0.000553 

16. 

268857 

0.000553 

16. 

268859 

0.000553 

16. 

268858 

0.000553 

16. 


Current estimate of initial depth * 151.021904 


Opened forward.flying frame 22 

0.000760 268 
0.000956 30 
0.001404 8 

0.000450 8 

0.000596 8 

0.000568 8 

0 . 000574 8 

Current estimate of initial depth = 149.947839 


k,a,b,s,theta err 
k,a,b,s,theta err 
k.a.bjSjtheta err 
k,a,b,s,theta err 
k,a,b,s, theta err 
k,a,b,s,theta err 
k,a,b,s,theta err 


0 3.710583 3.712933 1.411131 

1 4.293034 4.237146 1.417225 

2 4.125841 4.143232 1.415531 

3 4.139481 4.153663 1.415352 

4 4.137228 4.152442 1.415281 

5 4.137534 4.152607 1.415305 

6 4.137496 4.152581 1.415299 


Opened forward.flying frame 28 


k,a,b,s,theta err 
k,a,b,s,theta err 
k,a,b,s,theta err 
k,a,b,s,theta err 
k,a,b,s,theta err 
k,a,b,s,theta err 
k,a,b,s,theta err 


0 5.265904 

1 6.015904 

2 5.956887 

3 5.961518 

4 5.960965 

5 5.961043 

6 5.961033 


5.285103 

1 . 596075 

-0.000731 

522 

6.035103 

1.601842 

-0.001724 

25 

5.961528 

1.600377 

0.000404 

18 

5.965533 

1.599840 

0.000178 

18 

5.965257 

1.599872 

0.000228 

18 

5.965267 

1.599865 

0.000224 

18 

5.965264 

1.599866 

0.000224 

18 


735245 

976557 

046618 

436787 

535572 

499908 

508102 

506060 

506615 

506516 


700623 

175304 

244106 

060862 

049483 

049872 

049752 


.403687 

.510233 

.381941 

.452717 

.443174 

.443617 

.443457 
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Figure 38: Depth convergence with iterations (increased triangulation baseline). 

Current estimate of initial depth * 149.354233 

Opened f orward.flying frame 34 

k,a,b,s,theta err * 0 7.238398 7.243535 1.835851 
k,a,b,s,theta err ■ 1 7.988398 7 . 993535 1 . 834242 
k,a,b,s, theta err = 2 8.284004 8.308021 1.832178 
k,a,b,s, theta err « 3 8.285615 8.306216 1.832047 
k,a,b,s,theta err ■ 4 8.285925 8.305839 1.831991 
k,a,b,s, theta err * 5 8.285982 8.305765 1.831979 
k,a,b,s, theta err = 6 8.285994 8.305748 1.831976 
k,a,b,s,theta err ■ 7 8.285996 8.305745 1.831976 

Current estimate of initial depth ■ 149.733168 

Final estimate of initial depth = 149.733168 

There are a few interesting observations to make: 

1. Frame-0 is always used as the basis for comparison — initially with frame-2, then with 
frames 5, 10, 16, 22, 28, and 34. The depth estimate improves with the frame separation 
as shown in figure 38 

2. Notice that the first line of each block represents the initial conditions for a, b, s, and 6. 
In the first block, these are 0.0, 0.0, 1.0, 0.0 because we do not know any better. The 
last line of each block represents the converged values which are used to predict the initial 
conditions for the next block. 

3. The error in each block starts from some value and usually drops and stabilizes. If the 
initial guess falls far from the minimum but inside the capture zone, then the error starts 


0.000272 938.172974 
-0.009328 124.417595 
-0.000755 30.430639 
-0.000582 30.408218 
-0.000556 30.394806 
-0.000549 30.392294 
-0.000548 30.391562 
-0.000548 30.391680 
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from a large value and drops sharply. If the initial guess happened to be good, then the 
errors are already “converged”; this is exemplified by the second block belonging to the 
frame pair (0,5). 

4. The final result was obtained from the image pair (0,34) — which does not necessarily repre- 
sent the maximum frame separation possible. We have thus effectively used a triangulation 
baseline of 68 m which constitutes a substantial fraction of the initial depth of 150 m. This 
is the reason why we regard this algorithm as a track-before-detect one. In this example, 
the accuracy of the final result is 0.178 percent. 

We have run the algorithm on various other simulated cases — at and around the FOE. Generally, 
the depth accuracies are better than 2%, and they improve as we get closer to the FOE. 



Figure 39: The first “newline” image. 


We now present real-data cases from our imagery set “newline"; the first image of this 
sequence is shown in figure 39. The scene is that of a runway with a few surveyed trucks. The 
images are of size 512 x 512, the speed is 30.17 ft/s, and the frame rate 30 per second. There 
are only minor maneuvers in this flight. The convergence curve is shown in figure 40 for the 
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Figure 40: Depth convergence with iterations for “newline” leftmost truck. 

leftmost truck which is at depth of 405 ft. The frame pairs used are: frame-0 with 2, 5, 10, 16, 
22, 28, 34, and 40. Each iteration uses the next-larger frame separation. The converged depth 
resulting from the algorithm is 368, so that the accuracy here is of 9%. For the farther truck on 



012943«7«f 


Figure 41: Depth convergence with iterations for “newline” leftmost truck. 

the left, the algorithm ran ten iterations (last frame pair was (0,52)) and converged on a depth 
of 583 ft, where the ground-truth depth is 655 ft — accuracy of 11%. The convergence curve is 
shown in figure 41. The objects in these two examples show very little texture, and they are 
also small and far (TTC ~ 10 s) which may explain why the algorithm does not perform that 
well. Still the results can be considered satisfactory. 


6 ERROR ANALYSIS 


In this section we analyze the depth error as achieved by combining the depth results from 
lateral translation and those from expansion. We have already discussed the accuracy of the 
depth derived from lateral translation which is given by (18) where a u is given by figure 6. 
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The accuracy of the depth derived from expansion is determined by that of the expansion 
factor. When all the (four) parameters have converged, and thus compensated for, the case be- 
comes that of nominally zero distortion and shifts. Therefore we have to examine the sensitivity 
of the correlation peak value to residual errors in the expansion factor alone. This accuracy is 
determined by the additive noise at the peak (denoted by C/v(0, 0)). Notice that, so far, we have 
neglected this noise because it is practically much smaller than the sidelobe noise which results 
from the randomness of the image itself. The additive noise at the peak is given by equation 
(19) of [31] which is similar to (41) but with r u = t v = 0 and one of the correlation functions 
replaced by that of the noise, Rn(u, u), that is, 

/■+oo f+OO 

var{CV(0,0)} = L~ 2 / / g(u,v)R(u, v)R N (u,v)dudv (47) 


Correlation peak low 



Figure 42: Loss in correlation peak value due to residual errors in scaling factor. 


For simplicity we use equal R(t u , t u ) and Rn{t u , t u ) as given by (16). The question is now: 
what is the change in the exp ansion factor w hich causes a change in the correlation peak equal 
to the standard deviation, yvar{Cjv(0, 0)}. The correlation peak, as given by (5) of [31], is 
plotted in figure 42. For the same example used earlier, where L/A = 14, an d assuming an 
image signal-to-noise ratio of a 100, it is found from (47) that ^var{CV(0, 0)} = 0.000177. In 
the figure, the point having L/A = 14 and an ordinate of — 0.177 falls between the graphs of 
s = 0.003 and s = 0.004. Interpolating between these, results in s = 0.00325. 


The relationship between the s error and the depth error is derived from (44), where we 


had 


zo 


sV z kt 

s — 1 ’ 


(48) 


so that 


dzo ds ds 

v 

Zo s(s — 1) s — 1 


(49) 
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We can thus express the expansion-based depth standard deviation as 

v,Zo 

G Z» - 

S — 1 

For s = 1.0274, as was used to create figure 10 (z 0 = 150 m), and with a a = 0.00325, (50) yields 
o zt = 17.8 m which is close to the simulation results. 

The depth information contained in the expansion factor, s, and that contained in the shifts 
,(a, 6), is likely to be correlated because it is the same additive noise that causes inaccuracies in 
both measurements. Developing the necessary covariance matrix that relates their errors is not 
an easy task, and we thus forego that job here. However, we can still write down the combining 
algorithm for the initial-depth unbiased estimate, zb, as (see [33]) 


( 50 ) 


zo = kz a + (1 - k)z t , 


(51) 


where z a is the expansion-based depth measurement and z t the translation- (or shifts-) based 
one. k is determined by the variances, cr 2 zs of z a and o 2 t of z t , and by their correlation coefficient 
P , as 

1 &zt P®Zt G z& 

+ a lt - 2pVzt<?z a ’ 

and the minimum error — using this k — is then 



E{e ! } £E{( 2 „ -*,)’} = 


~ P 1 ) 


(53) 


We know that, close to the FOE, cr za <C Gzt so that, irrespective of p, k =£> 1, and vice 
versa. This means that, even if we use some guessed p of, say, 0.5 at this point, we will still 
be combining the measurements in a consistent way; that is the accurate measurement will 
contribute more than the inaccurate one — although, without knowing p, the proportions will 
not be optimal. 


7 CONCLUDING REMARKS 


In this paper we developed a new expansion- based passive-ranging algorithm that can com- 
plement the existing shift-based algorithm in the image areas near the FOE. We presented 
simulation and real-data results and compared them with the analysis results. 

In the future we intend to develop this algorithm in two directions. One is to make it process 
an image sequence in real time and produce range maps. The other is to use it to segment an 
image into objects. 
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